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Abstract 

We present a method which tests whether or not two datasets (one of which could 
be Monte Carlo generated) might come from the same distribution. Our method 
works in arbitrarily high dimensions. 
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1 Introduction 



Many sciences today rely heavily on the use of Monte Carlo simulation. In 
High Energy Physics (HEP) for example it is used in practically every stage 
of an experiment from the design of the detectors to the final analyses. This 
brings up the question of the precision of the MC simulation. In other words, 
how close are the probability distributions of the MC to those of the experi- 
mental data? Testing whether two datasets come from the same distribution 
is a classical problem in Statistics, and for one- dimensional datasets a large 



number of methods 
proposed by Bickel 



rave been developed. Tests in 



ligher dimensions have been 



l| and Friedman and Rafsky 2| . Zech 



3] discussed a test 



based on the concept of minimum energy. The test proposed here belongs to 
a class of consistent, asymptotically distribution free tests based on nearest 



neighbors (Narsky 



4[, Bickel and Breiman |5], Henze 6[, and Schilling 



3)- 



We concentrate in this paper on comparing two datasets, either both real data 
or one real and one Monte Carlo, but the proposed method also allows us to 
test whether a dataset comes from a known theoretical distribution as long as 
we can simulate data from this distribution. 
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2 The Method 



To set the stage, let's say we have observations (events) Xi, ..,X n from some 
distribution F, and observations Yi,..,Y m from some distribution G. In the 
application we have in mind one of these would be MC generated data and the 
other "real" data, but this is not crucial for the following. We are interested in 
testing H : F = G vs H a : F ^ G. The idea of our test is this: let's concentrate 
on one of the X observations, say Xj. What is its nearest neighbor, that is, the 
observation closest to it? If the null hypothesis is correct and both datasets 
were generated by the same probability distribution, then this nearest neighbor 
is equally likely to be one of the X or Y observations (proportional to n and 
to). If F ^ G there should be regions in space where there are relatively more 
X or Y observations than could be expected otherwise. 

More formally, let Zj = 1 if the nearest neighbor of Xj is from the X data 
and otherwise. Then, under the null hypothesis, Zj is a Bernoulli random 
variable with success probability (n — l)/(n + m — 1). Therefore Z = J2]=i Zj 
has an approximate binomial distribution with parameters n and (n — l)/(n + 
to — 1) . The distribution is only approximately binomial because the Z'-s are 
not independent, but for datasets of any reasonable size the dependence is 
very slight and can be ignored. 

There is an immediate generalization of the test: instead of just considering the 
nearest neighbor we can find the k- nearest neighbors. Now Zj = (Zj 1 , .., Zj k ) 
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with Zji = 1 if the i th nearest neighbor of Xj is from the X dataset, other- 
wise. Under the null hypothesis Z = Y^j=i YJi=\ Zji has again an approximate 
binomial distribution with parameters nk and (n — l)/(n + m — 1). (Actually, 
Yli=i Zji has a hypergeometric distribution, but because we will use a k much 
smaller than n or m the difference is negligible.) 

We can find the p-value of the test with 

p = p(y > Z) 

where V ~ Bin(nk, (n — l)/(n + m — 1)). 

Especially if n and m are small or if a relatively large k is desired it would be 
possible to use a permutation type method to estimate the null distribution. 
The idea is as follows: under Hq all the events come from the same distribu- 
tion, so any permutation of the events will again have the same distribution. 
Therefore if we join the X and Y events together, shuffle them around and 
then split them again into n and m events X' and Y', these are now guaran- 
teed to have the same distribution. Applying the k-nearest neighbor test and 
repeating the above many times (say 1000 times) will give us an estimate of 
the null distribution. For more on the idea of permutation tests, see Good 8]. 
This method will achieve the correct type I error probability by construction, 
but it also requires a much greater computational effort. 

There are a number of choices to be made when using this method. First of all, 
there is the question of which dataset should be our X data. Clearly, if n = m, 



this does not matter but it might well otherwise. Indeed, in our application of 
comparing MC data to real data, we have control of the size of the MC data 
although sometimes there are computational limits on its size. 

Next we need to decide on k. Again there is no obviously optimal choice here. 
Finally, we need to decide on a metric to use when finding the nearest neighbor. 
If the observations differ greatly in "size" in different dimensions, the standard 
Euclidean metric cannot be expected to work well because small differences in 
dimensions with a large spread would overwhelm small but significant differ- 
ences in dimensions with a small spread. This last issue we will deal with by 
standardizing each dimension separately, using the mean and standard devia- 
tion of the combined X and Y data. If the data come from a distribution that 
is severely skewed, other measures of location and spread (such as the median 
and the interquartile range) could also be used to standardize the data. In 
the next section we will show the results of some mini MC studies which give 
some guidelines for the choices. 



3 Performance of this Method 

If we use the binomial approximation in our test, we will need to verify that 
the method works, that is, that it achieves the desired type I error probability 
a. Of course, we are also interested in the power of the test, that is, the 
probability to reject the null hypothesis if indeed F ^ G. In this section we 



will carry out several mini MC studies to investigate these questions. 

We start with the situation where there exist other methods for this test, 
namely in one dimension. For comparison we will use a method that is known 
to have generally very good power, the Kolmogorov-Smirnov (KS) test. In the 
first simulation we generate n = m observations from 1000 to 20000 in steps of 
1000 each for X and Y from the uniform distribution on [0, 1]. Because of the 
probability integral transform this is actually a very general case, and similar 
conclusions will hold for all other continuous distributions in one dimension. 
For each of these cases we use our test with k — 1, 2, 5 , 10 , 20 , 50 and 100 as 
well as the KS test. This is repeated 10000 times. Figure 1 shows the results, 
using a nominal type I error probability of a = 5%. 

As we see the true type I error probability is close to nominal but increases 
slowly as k increases. This is due to the lack of independence between the Z/s. 
Especially if k is large relative to n or m , the true type I error probability is 
larger than what is acceptable. Based on this and similar simulation studies, 
we recommend k — 10 if both n and m are at least 1000, otherwise k = 
0.01 min(n, m). Alternatively, one can use the permutation method described 
above which will have the correct type I error by construction. 

Even for k = 1 we have a slightly higher than nominal type I error probability, 
about 5.5% if a = 5%. This is partly due to the dependence between the Z/s, 
and partly to the discrete nature of the binomial distribution. For example, 
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if we defined the p- value as P(V > Z — 1) we would get a true type I error 
probability slightly smaller than the nominal one. In any case, we believe this 
difference to be acceptable. 

Next a simulation to study the power of the test, again in one dimension. We 
use the uniform on [0, 1] distribution for F and the uniform on [0, 9} for G, 
where 9 goes from 1 to 1.1 We generate n = m = 1000 events and apply our 
test with k — 1, 5, 15 and 25 as well as the KS test. This is repeated 10000 
times. The result is shown in Figure 2. Clearly, the higher the k the better the 
power of the test. In fact, for this example already with k — 15 the test has 
better power than the KS test! Generally speaking, in one dimension, our test 
has power somewhat inferior to the KS test. However, our test is not meant 
to be used in one dimension. It is encouraging to find that it does fairly well 
even in that situation. 

How should one choose the size of the Monte Carlo dataset relative to the size 
of the true data? In the next simulation we generate n = 1000 events from a 
uniform [0, 1] and assume this to be the real data. Then we generate m events 
from the same distribution, with m going from 50 to 2500. In Figure 3 we show 
the results which indicate that the MC dataset should have the same size as 
the real data because in that case the true type I error probability is about the 
same as the nominal one. This agrees with general statistical experience which 
suggests that, in two-sample problems, equal sample sizes are often preferable. 
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Finally we present a multi-dimensional example. We generate n = m = 1000 
events. The F distribution is a multivariate standard normal in 9 dimensions, 
and the G distribution a multivariate normal in 9 dimensions with means 0, 
standard deviations 1 and correlation coefficients cor(Xi, Xj) = p if = 1 

and if \i — j\ > 1, where p goes from to 0.5. This example illustrates 
the need for a test in higher dimensions. Here all the marginals are standard 
normals, and any one-dimension-at-a-time method is certain to miss the dif- 
ference between F and G. This is shown in Figure 4 where with k = 10 we 
reject the null hypothesis quite easily (at p = 0.5) whereas the KS test applied 
at any of the marginals fails completely. 



4 Implementation 



A C++ routine that carries out the test is available from one of the authors 



at http://math.uprm.edu/~wrolke/. It allows the use of the binomial approx- 
imation as well as the permutation method. It uses a simple search for the 
k-nearest neighbors. k-NN searching is a well known problem in computer 
science, and more sophisticated and faster routines exist and could also be 
used in combination with our code. (See, for example, Friedman, Baskett and 
Shustek [9J .) 



5 Summary 



We describe a test for the equality of distributions of two datasets in higher 
dimensions. The test is conceptually simple and does not suffer from the curse 
of dimensionality. Simulation studies show that it approximately achieves the 
desired type I error probability, or does so exactly at a higher computational 
cost. They also show that this test is capable to detect differences between 
the distributions only "visible" in higher dimensions. 
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Fig. 1. The true type I error probability in a one dimensional example as a function 
of sample size. X and Y have a uniform distribution on [0,1]. n = m, and the 
nominal type I error probability is 5%. Each curve represents a different k value. 
The KS test gives a flat line at 5%. 
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Fig. 2. F = U[0,1], G = U[0,9] where 6 goes from 1 to 1.1 We generate 
n = m = 1000 events, k = 1, 5, 15 and 25. 
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Fig. 3. n = 1000 from F = U[0, 1], m goes from 50 to 2500, G = F. 
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Fig. 4. n = m = 1000. The F distribution is a multivariate standard normal in 9 di- 
mensions, and the G distribution a multivariate normal in 9 dimensions with means 
0, standard deviations 1 and correlation coefficients cor(Xi, Xj) = p if \i — j\ = 1 
and if \i — j\ > 1, where p goes from to 0.5. The multidimensional test has much 
better power than any of the one-dimensional KS tests. 
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